Author: Ray Wu

Bringing in Rectangular data

First, we load the gapminder and tidyverse packages:

library(gapminder)
library(tidyverse)
## Note: the specification for S3 class "difftime" in package 'lubridate' seems equivalent to one from package 'hms': not turning on duplicate class definitions for this class.
## ── Attaching packages ────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.0.0     ✔ purrr   0.2.5
## ✔ tibble  1.4.2     ✔ dplyr   0.7.6
## ✔ tidyr   0.8.1     ✔ stringr 1.3.1
## ✔ readr   1.1.1     ✔ forcats 0.3.0
## ── Conflicts ───────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

Factor Management

Drop Oceania

First, let’s take a look at the dataset:

gapminder
## # A tibble: 1,704 x 6
##    country     continent  year lifeExp      pop gdpPercap
##    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Afghanistan Asia       1952    28.8  8425333      779.
##  2 Afghanistan Asia       1957    30.3  9240934      821.
##  3 Afghanistan Asia       1962    32.0 10267083      853.
##  4 Afghanistan Asia       1967    34.0 11537966      836.
##  5 Afghanistan Asia       1972    36.1 13079460      740.
##  6 Afghanistan Asia       1977    38.4 14880372      786.
##  7 Afghanistan Asia       1982    39.9 12881816      978.
##  8 Afghanistan Asia       1987    40.8 13867957      852.
##  9 Afghanistan Asia       1992    41.7 16317921      649.
## 10 Afghanistan Asia       1997    41.8 22227415      635.
## # ... with 1,694 more rows

So now, we know that we are dropping the ‘Oceania’ level from the ‘continent’ factor

Let’s take a look what would happen when we drop Oceania:

gapminder %>% 
  filter(continent == 'Oceania')
## # A tibble: 24 x 6
##    country   continent  year lifeExp      pop gdpPercap
##    <fct>     <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Australia Oceania    1952    69.1  8691212    10040.
##  2 Australia Oceania    1957    70.3  9712569    10950.
##  3 Australia Oceania    1962    70.9 10794968    12217.
##  4 Australia Oceania    1967    71.1 11872264    14526.
##  5 Australia Oceania    1972    71.9 13177000    16789.
##  6 Australia Oceania    1977    73.5 14074100    18334.
##  7 Australia Oceania    1982    74.7 15184200    19477.
##  8 Australia Oceania    1987    76.3 16257249    21889.
##  9 Australia Oceania    1992    77.6 17481977    23425.
## 10 Australia Oceania    1997    78.8 18565243    26998.
## # ... with 14 more rows

Since we have 24 rows and 12 years for each country, we should have 24 entries less or 2 countries less after the modification, whatever one would prefer.

We can also get more information about the dataset as follows:

gapminder %>% 
  str()
## Classes 'tbl_df', 'tbl' and 'data.frame':    1704 obs. of  6 variables:
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ pop      : int  8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
##  $ gdpPercap: num  779 821 853 836 740 ...

Actually dropping Oceania

(gapminder_no_oceania = gapminder %>% 
  filter(continent != 'Oceania'))
## # A tibble: 1,680 x 6
##    country     continent  year lifeExp      pop gdpPercap
##    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Afghanistan Asia       1952    28.8  8425333      779.
##  2 Afghanistan Asia       1957    30.3  9240934      821.
##  3 Afghanistan Asia       1962    32.0 10267083      853.
##  4 Afghanistan Asia       1967    34.0 11537966      836.
##  5 Afghanistan Asia       1972    36.1 13079460      740.
##  6 Afghanistan Asia       1977    38.4 14880372      786.
##  7 Afghanistan Asia       1982    39.9 12881816      978.
##  8 Afghanistan Asia       1987    40.8 13867957      852.
##  9 Afghanistan Asia       1992    41.7 16317921      649.
## 10 Afghanistan Asia       1997    41.8 22227415      635.
## # ... with 1,670 more rows

Let’s check the modified factor:

gapminder_no_oceania$continent %>% 
  levels()
## [1] "Africa"   "Americas" "Asia"     "Europe"   "Oceania"

We still have Oceania! We need to call the droplevels() function to actually drop Oceania.

(gapminder_no_oceania = gapminder_no_oceania %>% 
  droplevels())
## # A tibble: 1,680 x 6
##    country     continent  year lifeExp      pop gdpPercap
##    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Afghanistan Asia       1952    28.8  8425333      779.
##  2 Afghanistan Asia       1957    30.3  9240934      821.
##  3 Afghanistan Asia       1962    32.0 10267083      853.
##  4 Afghanistan Asia       1967    34.0 11537966      836.
##  5 Afghanistan Asia       1972    36.1 13079460      740.
##  6 Afghanistan Asia       1977    38.4 14880372      786.
##  7 Afghanistan Asia       1982    39.9 12881816      978.
##  8 Afghanistan Asia       1987    40.8 13867957      852.
##  9 Afghanistan Asia       1992    41.7 16317921      649.
## 10 Afghanistan Asia       1997    41.8 22227415      635.
## # ... with 1,670 more rows
gapminder_no_oceania$continent %>% 
  levels()
## [1] "Africa"   "Americas" "Asia"     "Europe"

Now, we see that Oceania is actually gone for good.

Creating a smaller version of the dataset to read/write from the disk (dataset filtered down to data from 2002) and to reorder factors

gapminder_2002 = gapminder %>% 
  filter(year == 2002)
(gapminder_asia_2002 = gapminder_2002 %>% 
  filter(continent == 'Asia'))
## # A tibble: 33 x 6
##    country          continent  year lifeExp        pop gdpPercap
##    <fct>            <fct>     <int>   <dbl>      <int>     <dbl>
##  1 Afghanistan      Asia       2002    42.1   25268405      727.
##  2 Bahrain          Asia       2002    74.8     656397    23404.
##  3 Bangladesh       Asia       2002    62.0  135656790     1136.
##  4 Cambodia         Asia       2002    56.8   12926707      896.
##  5 China            Asia       2002    72.0 1280400000     3119.
##  6 Hong Kong, China Asia       2002    81.5    6762476    30209.
##  7 India            Asia       2002    62.9 1034172547     1747.
##  8 Indonesia        Asia       2002    68.6  211060000     2874.
##  9 Iran             Asia       2002    69.5   66907826     9241.
## 10 Iraq             Asia       2002    57.0   24001816     4391.
## # ... with 23 more rows

Factor reordering

Now we will see what happens before we reorder the factors:

gapminder_asia_2002 %>% 
  ggplot(aes(pop, country)) + 
  geom_point() + 
  scale_x_log10() + 
  ggtitle('log(Population) of Asian Countries, 2002')

It’s pretty diffcult to get any sense of ordering on this graph.

Now we will reorder the levels and re-make this plot:

gapminder_asia_2002 %>% 
  mutate(country = fct_reorder(country, pop, .fun=median)) %>% 
  ggplot(aes(pop, country)) + 
  geom_point() + 
  scale_x_log10() + 
  ggtitle('log(Population) of Asian Countries, 2002')

This is clearly a much better graph as it also allows us to - view the extreme points much easily - view the distribution much easily

arrange

It seems that we should be able to do the same thing with arrange(). After all, we are only sorting the data before plotting.

gapminder_asia_2002 %>% 
  arrange(pop)  %>% 
  ggplot(aes(pop, country)) + 
  geom_point() + 
  scale_x_log10() + 
  ggtitle('log(Population) of Asian Countries, 2002')

This does not work because we are not changing the factors, which the plot is based off of. We are changing the rows in the table, but the categories are still plotted alphabetically.

Using fct_reorder, on the other hand, actually relabels the categories according to the ranking of their population. Hence, the plot with fct_reorder is different because the first category corresponds to the country with highest population, instead of the first country that comes along alphabetically.

file i/o

We will demonstrate file i/o with the gapminder_2002 data frame.

write_csv()/read_csv()

write_csv(gapminder_2002, 'gapminder_2002.csv')

confirm that the file exists:

list.files(pattern = "gapminder_2002.csv")
## [1] "gapminder_2002.csv"

We see that the file gapminder_2002.csv exists so we know that write_csv worked as intended

read_data = read_csv('gapminder_2002.csv')
## Parsed with column specification:
## cols(
##   country = col_character(),
##   continent = col_character(),
##   year = col_integer(),
##   lifeExp = col_double(),
##   pop = col_integer(),
##   gdpPercap = col_double()
## )
read_data %>% str()
## Classes 'tbl_df', 'tbl' and 'data.frame':    142 obs. of  6 variables:
##  $ country  : chr  "Afghanistan" "Albania" "Algeria" "Angola" ...
##  $ continent: chr  "Asia" "Europe" "Africa" "Africa" ...
##  $ year     : int  2002 2002 2002 2002 2002 2002 2002 2002 2002 2002 ...
##  $ lifeExp  : num  42.1 75.7 71 41 74.3 ...
##  $ pop      : int  25268405 3508512 31287142 10866106 38331121 19546792 8148312 656397 135656790 10311970 ...
##  $ gdpPercap: num  727 4604 5288 2773 8798 ...
##  - attr(*, "spec")=List of 2
##   ..$ cols   :List of 6
##   .. ..$ country  : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ continent: list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ year     : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ lifeExp  : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ pop      : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ gdpPercap: list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   ..$ default: list()
##   .. ..- attr(*, "class")= chr  "collector_guess" "collector"
##   ..- attr(*, "class")= chr "col_spec"

We don’t see a factor anywhere. This indicates that the factors are not preserved after writing to a CSV file. We will see a better method to do this in the next section.

saveRDS()/readRDS()

gapminder_2002 %>% saveRDS('gapminder_2002.rds')

Check to make sure that the file exists:

list.files(pattern = "gapminder_2002.rds")
## [1] "gapminder_2002.rds"

As expected, the file exists.

Now, read in the file again:

rds_file = readRDS('gapminder_2002.rds')

No errors! That’s a good start, now let’s check the dataset description:

rds_file %>% str()
## Classes 'tbl_df', 'tbl' and 'data.frame':    142 obs. of  6 variables:
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 4 1 1 2 5 4 3 3 4 ...
##  $ year     : int  2002 2002 2002 2002 2002 2002 2002 2002 2002 2002 ...
##  $ lifeExp  : num  42.1 75.7 71 41 74.3 ...
##  $ pop      : int  25268405 3508512 31287142 10866106 38331121 19546792 8148312 656397 135656790 10311970 ...
##  $ gdpPercap: num  727 4604 5288 2773 8798 ...

As expected, we do not encounter any problems with reading in the .rds file. In particular, we note that country and continent are factors as expected.

(Note that I filtered from the original data frame, so we still have 142 countries and 5 continents)

part 3

I am going to re-make a plot I handed in for assignment 2:

original:

ggplot(gapminder, aes(continent)) + 
  geom_bar(fill = 'dark green')

Let’s see how we can improve this: - count (on the y-axis is unclear). It seems that we are recording the number of countries, but it is not apparent from the axis - we should give a title - entries from different years are all mixed together. It is hard to imagine this being useful. - could be more colourful although the current scheme is readable

Instead, I am going to do the following: - contrast the total population of the 5 continents - express these values as percentages in order to make it easy to see which continents have increased and decreased their proportion of world population - give the graph a meaningful title - use colour to contrast the change or some other meaningful way - separate the years

First, let’s calculate the sum of population for each continent/each year

plot_data = gapminder %>% 
  group_by(continent,year) %>% 
  summarize(totalPop = sum((as.double(pop)))) # we need this to prevent integer overflow 

We also need to get the world population for each point in time, which the following code block does:

plot_data = plot_data %>%
  group_by(year) %>% 
  mutate(popRatio = totalPop/sum(totalPop))

Finally, we generate a stacked-area graph, which allows us to accurately visualize the proportion of categories over time; in this case, it is the continents and how their population progresses as a proportion of the world population.

(improved_graph = plot_data %>%
  ggplot(aes(year, popRatio, fill=continent)) + 
  geom_area(position = 'stack') + 
  xlab('year') +
  ylab('Percentage of world population') +
  ggtitle('Proportion of world population in continents over time'))

We can see from this graph that Asia has the majority of the world’s population, and Americas’ hasn’t changed must in the last 60 years or so. Africa’s population has increased and Europe’s has decreased. Oceania has always been rather un-populated.

plotly

library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
improved_graph_plotly = improved_graph %>% ggplotly()

Let’s take a look at the new file:

improved_graph_plotly

The most distinctive thing about the plotly graph is interactivity: I can hover my mouse over a data point and I can read it off. This seems to be better for people using Rmd but not necessarily for publishing graphs in papers because obviously such a feature is not possible on paper or pdf.

Also, this is not checkable on github because it is rendered in md.

Part 4: saving figures to file

ggsave('pop_prop_time.png', plot = improved_graph)
## Saving 7 x 5 in image
Graph of Population Proportions over time

Graph of Population Proportions over time